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ABSTRACT 

The regions of the polypeptide chain immediately 
preceding or following an ot-helix are known as Nt- 
and Ct cappings, respectively. Cappings play a 
central role stabilizing a-helices due to lack of 
intrahelical hydrogen bonds in the first and last 
turn. Sequence patterns of amino acid type prefer- 
ences have been derived for cappings but the struc- 
tural motifs associated to them are still unclassified. 
CAPS-DB is a database of clusters of structural 
patterns of different capping types. The clustering 
algorithm is based in the geometry and the ((|>-v[/)- 
space conformation of these regions. CAPS-DB is a 
relational database that allows the user to search, 
browse, inspect and retrieve structural data asso- 
ciated to cappings. The contents of CAPS-DB 
might be of interest to a wide range of scientist 
covering different areas such as protein design and 
engineering, structural biology and bioinformatics. 
The database is accessible at: http://www 
.bioinsilico.org/CAPSDB. 

INTRODUCTION 

The first and last turn of oc-helices lack intrahelical 
hydrogen-bonds requiring its Ct and N-ends to be stabil- 
ized by cappings to avoid the fray. Capping motifs were 
first described by Richardson and Rose in 1988 (1,2) 
as formed by interactions of the last residues of the helix 
with the closest residues of the polypeptide sequence. 
Subsequent works extended the classification of helical 
Ct- (3,4) and Nt cappings (5-9) and terms such as the 
'Schellman', capping-box, big-box or ocL capping motifs 
were introduced. A more systematic classification of cap- 
pings was presented by Aurora and Rose (10) by analysing 



a large data set of protein structures classifying helix- 
capping motifs according to specific patterns of hydrogen 
bonding and hydrophobic interactions found at or near 
the ends of helices in both proteins and peptides. 
Consequently, cappings play an important role in the sta- 
bility (11-16) and function of proteins (17). Its functional 
relevance is in general associated with structure stabiliza- 
tion of a protein or fold (18), which occasionally may 
imply other type of associated disorders or diseases [e.g. 
in a prion-protein stability (19) or diabetes mellitus 
produced by a misfolded transcription factor (20)]. 

Although there has been extensive studies showing the 
sequence preferences and atomic interactions in cappings, 
there has not been a systematic classification of the struc- 
tural patterns associated to cappings. We have used the 
definition of the topological features of the classification 
of loops to cluster the conformations of the polypeptide 
chain forming the capping in order to automatically 
extend the pre-genomic era classification. As a result a 
novel and automated structural classification and 
database of helix cappings, CAPS-DB, is presented. 
Cappings were extracted from high-quality protein struc- 
tures and structurally clustered based on geometry and 
conformation (<(>-v|/ angles). The clustering process does 
not require the structural alignment of cappings but a 
two-step process were both geometry and conformation 
are compared. This procedure allows for a post-clustering 
quality check based in Root Mean Square Deviation 
(RMSD) values. The entire process, from extraction to 
clustering and archiving is fully automated, which results 
on CAPS-DB being updated on a regular basis. The 
current version of CAPS-Z>2?, v. 2.0, contains 16195 Nt 
cappings and 16188 Ct cappings extracted from 3848 
protein structures that have been classified into 905 
clusters. CAPS-DB features a clear and intuitive web inter- 
face that allows users to search, browse and retrieve data 
easily and conveniently. 
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DEFINITION OF HELIX-CAPPING MOTIFS AND 
GEOMETRY 

The definition of capping used in this work fall in the short 
and medium-range capping interaction categories 
proposed by Aurora and Rose (10). Thus, the residues be- 
fore a helix define an Nt capping where the first residue of 
the capping interacts with residue(s) in the first turn of the 
helix. Conversely, a Ct capping is defined by the portion of 
the polypeptide chain between the end of a helix and the 
last residue that interacts with any group in the last turn of 
the helix (Figure 1A). 

The concept of geometry derives from our previous 
work in loop structure classification (21,22). In the case 
of cappings, the geometry characterizes the topology and 
is defined by two internal coordinates: a distance and an 
angle. The distance D is defined by the Euclidian distance 
between the first (Nt capping) or last (Ct capping) C-a of 
the oc-helix and the first (Nt capping) or last (Ct capping) 
C-a sequentially adjacent that is within atomic interaction 
distance of any of the residues in the first (Nl, N2, N3, 
N4) or last (C-l, C-2, C-3, C-4) turn of the oc-helix (no- 
menclature for helix residues as defined by Aurora and 
Rose (10)). The delta angle is defined by the angle 
between the vector defined by the moment of inertia M 
of the oc-helix and the distance vector D (Figure 1A). The 
range of parameter D would depend on the number of 
residues that form the capping being typically <16A 
while the delta angle ranges between 0 and 180°. 



CLUSTERING ALGORITHM 

The clustering algorithm operates at two different levels: 
geometry and conformation. On the first stage, cappings 
are merged into groups that share the same geometry. Two 
cappings would have the same geometry if A(D, delta) 
values belongs to the two dimensional semi-open interval 
/ = [(0,0), (4,45)]. The partition of the geometrical space 
was optimized via a comprehensive exploration of differ- 
ent bin definitions. During the second stage, cappings with 
the same geometry are clustered based on a conformation 
similarity score. The conformation of the residues within 
the capping is encoded by assigning a conformation code 
that depends on the c|)-v|/ angles according to the partition 
of the Ramachandran space showed in Figure IB. Thus, 
the conformation of a given capping is encoded by a string 
of characters each of which described a precise region of 
the (c|)-v|/)-space. The clustering method is based on a 
density search algorithm (23). Because no structural super- 
position of cappings is required during the clustering, then 
a structural similarity measure, the RMSD of main chain 
atoms, is used to assess the quality of the clusters (see next). 



QUALITY OF CLUSTERS 

The clustering algorithm is based on geometry and density 
search on the (c|)-v|/)-space, is therefore independent of 
structural alignment between cappings. The RMSD 
within and between cappings of the same and different 




Figure 1. Definition of capping, geometry of cappings and partition and encoding of phi/psi angles. (A) Example of a Nt capping and geometry 
descriptors. (B) Encoding of the capping conformation as a string of characters using the partition of phi/psi space. The most accessible regions of 
the phi/psi space encoded as follow: a: oc-helix; 1: left-handed a-helix; b: p-strand; p: [5-prolin; g: y-helix; e: £. 
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Figure 2. Box plots of RMSD values as a function of capping size within (w) and between (b) clusters. The central horizontal line in the box marks 
the median and the box edges the first and third quartile; errors bars show minimum and maximum values. Box plots for Nt- and Ct cappings are 
shown in the upper and lower plot, respectively. 



clusters is then used to assess the quality of the clustering. 
Figure 2 shows the distribution of RMSD (main chain) 
upon structural alignment of cappings that belong to the 
same (w) or different clusters (b). Amongst cappings that 
have the same size and for all capping sizes the distribu- 
tion of RSMD values is significantly smaller for those cap- 
pings that belong to the same cluster (w) to those that 
belong to different cluster (b). This result proves that the 
clustering algorithm is indeed grouping structurally re- 
lated cappings. The clustering relies on comparison of 
strings values (i.e. geometry and conformation) and do 
not require complicate and computing intensive oper- 
ations such calculation of rotation/translation matrices 
(e.g. performing structure superposition), and thus is 
very fast and well suited for large-scales analysis. Being 
the clustering of capping the central element, then 
keeping CAPS-DB up-to-date will not require neither ex- 
tensive nor intensive computing, and thus ensuring its 
long-term viability. 



DATABASE IMPLEMENTATION AND CONTENTS 

CAPS-DB comprises two major components: a relational 
database for data storage and management and a web 
application to interface the database. Data is stored in a 
relational MySQL database whose design was optimized 
to provide a fast and optimal access to the information. It 
makes extensive use of master and internal keys and 
cross-references between tables. The MySQL server runs 
in a dedicated computer that also mirror all external data- 
bases that are required during the updating and annota- 
tion process, e.g. PDB (24). The web application runs on 
an Apache web-server hosted on a Red Hat® enterprise 
Linux operating system. CGI Perl, JavaScript and DBI- 
DBD modules are used to interface and access the data- 
base. Web pages resulting from queries and sequence 
searches are generated on the fly, i.e. are dynamic, thus en- 
suring up-to-date information is available to users. The 
web-site includes a Jmol applet (http://jmol.sourceforge. 
net) to visualize protein structures and a BLAST (25) 
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search engine to perform sequences searches (see 
Retrieving information section). Finally, documentation 
and explanations about contents, clustering and the use 
of ChPS-DB is provided to users in the help section of the 
website. 

The current version of C/KPS-DB, v. 2.0, classifies over 
30000 cappings extracted from 3848 protein structures 
and grouped in 905 clusters. The process of extraction 
and classification of cappings was performed as follow. 
An initial set of non-redundant protein structures was 
derived from the protein databank (PDB) (24) using 
PISCES (26). The parameters used to generate the initial 
set were: crystal resolution better that 2.0 A, a maximum 
i?-factor of 0.25, minimum chain length of 40 residues and 
a sequence cut-off of 25%. This initial set was subjected to 
further quality filters that included the checking for 
non-standard amino acids that were discarded with the 
exception of Se-Met that were converted into Met, 
atoms with alternative locations (only the first rotamer 
was kept) and amino acids with insertion codes that 
were renumbered if non-superposable. Missing main- 
chain and side-chain atoms were added using Maxsprout 
(27) and Scwrl 4.0 (28), respectively. The secondary struc- 
ture of proteins was assigned using DSSP (29) and the 
atomic interaction of capping-helices were defined using 
CSU (30). Finally, cappings and flanking helices were 
extracted and clustered using the approach described 
before. 



QUERYING AND RETRIEVING DATA 

There are different basic approaches to query and retrieve 
data from CAPS-Z>5. The first approach is by simply 
browsing the ontology of the classification (Figure 3). 
The first level is the cappings type: Nt- or Ct cappings. 
The second level is the capping size, i.e. the number of 
residues in the capping. Below capping size there are 
two more levels that relate to the geometry of the capping: 
D and 8 angle. Clusters are the lower level and present the 
cappings that are structurally equivalent and other related 
information (see next Cluster information). The second 
approach to query ChPS-DB is by simply providing the 
PDB identification code of the protein of interest in the 
text box embedded in the top menu (Figure 3). If the 
protein is classified in CAPS-Z>S, the server will return a 
list of clusters and the associated cappings. 

Users can also query CAPS-DB by doing a sequence 
search using a BLAST (25) engine implemented in the 
server (Figure 3). A sequence identity cut-off value and 
substitution matrix can be selected in the advanced option 
menu. The server will return a list of target proteins sorted 
by the BLAST scores and alignments can be inspected by 
following the relevant links (Figure 3). Finally, CAPS-Z>S 
also features an advanced search engine that allows more 
complex and elaborated queries that includes: searching 
for capping type, capping size or capping geometry; 
searches using free text or keywords (i.e. reductases), 
Unitprot accession number (13), or PubMed identifier; 
and any combination of the aforementioned queries. 



CLUSTER INFORMATION 

The clusters of the cappings are presented in individual web 
pages that consist of two main components: a general 
information section and a table with alignment of the 
cappings (Figure 4). 

The general section provides the following information: 
the type of capping (Nt or Ct), cappings size and the 
number of capping-motifs included in the cluster; the con- 
sensus geometry; the sequence, Ramachandran informa- 
tion on (<(),\|/)-conformational space, and secondary 
structure consensus; the average sequence identity among 
cappings included in the cluster; the average RMSD 
between capping regions (calculated using main-chain 
atoms); a PROSITE-like sequence pattern and the pattern 
entropy. The consensus sequences was derived from the 
aligned cappings selecting the amino acid type if conserved 
>90%, p for polar residues [D,E,H,K,N,Q,R,S,T,Y] or h 
for non-polar residues [A,C,F,G,I,L,M,PV,W,Y] if con- 
served in at least 75% of the sequences; 'X' otherwise. 
Ramachandran and secondary structure consensus were 
derived similarly. The PROSITE-like sequence pattern 
was derived from the weighted [as in Henikoff and 
Henikoff (31)] amino acid frequencies derived from the 
alignment where amino acids type are shown in lower 
case if having an individual frequency between 75% and 
90% and upper case if >90%; V otherwise. The pattern 
entropy was calculated using the AL2CO program (32) 
with default parameters. 

Another important element in the general information 
section is the so-called contact patterns matrix. This is a 
graphical matrix that shows the proportion of cappings in 
the cluster that have a given interaction pair, e.g. 
capping-motifs with an atomic interactions between C 
and C4 residues [nomenclature of residues adopted from 
Aurora and Rose (10)]. This matrix provides a visual rep- 
resentation of the atomic interactions that are more im- 
portant and conserved in the cappings, and thus 
interactions patterns that are more prevalent among 
capping-motifs. Detailed information about the atomic 
interaction types including structural visualization is also 
provided (see 'Conclusion' section). 

Cappings-motifs belonging to a cluster are presented in 
a table that includes information about protein structures 
of origin (PDB code, chain, start residue and end residues; 
numbering as in the PDB file), sequence, secondary struc- 
ture [as defined by DSSP (29)] and Ramachandran. 
The contents of the table and other additional files can 
be downloaded by following the relevant links. These 
include the superposed coordinates in compressed format 
(.gz); a PSSM profile; and a tab-delimited plain text file 
detailing the atomic interactions between cappings and 
helix termini. Finally, a Jmol applet allows the visualiza- 
tion of the structure of cappings by selecting them from a 
clickable list (Figure 4), including the atomic interactions 
between cappings and helix. 

CONCLUSION 

Here we present an automatic structural classification of 
cappings. The classified cappings presented in this work 
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Figure 3. Screenshot of a CAPS-DB ontology browser, search web page and BLAST sequence search engine including a BLAST output and 
alignment. 



have been stored in a MySQL relational database named 
CAPS-DB. The clustering approach can recapitulate struc- 
tural motifs for classical cappings such as the big-box, 
ocL-motif or Schellman (10). Users can browse among 



905 types of different capping conformations extracted 
from a non-redundant set of 3848 protein structures. 
Additionally, the database can be queried using 
a BLAST sequence search, functional-keyword search 
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Figure 4. Screenshot of CAPS-OB cluster information web page and Jmol applet. 



or any of the topological features used for the clustering. 
The information retrieved from the database ac- 
counts for the sequence profile and relevant conservation 
of specific residues. Finally, links are provided to the 
original PDB structures and visualization of cappings 
using Jmol. 

There are a number of areas where the data compiled in 
CAVS-DB can be applied. In comparative modelling and 
structural biology, the information classified in CAPS-DB 
can be used to define the boundaries of helices by 



comparing the sequence of helix to these classified in 
CAPS-DB. It can be also an important resource in protein 
design to improve the stability of helices by optimizing Ct 
and Nt cappings. Finally, sequence profiles derived from 
structural clusters can be used to improve secondary struc- 
ture prediction. Summarizing, CAPS-Z>S is a useful tool 
to predict structural and functional local-features and 
apply them to optimize in models, refine in crystals, 
design, construct and understand helix boundaries in 
proteins. 
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